Experiments in Data Compression IV
                    (or "The Need for Speed")
                         by John Kormylo
                         E.L.F. Software


Binary Search Tree

     The  LZA  algorithm  uses a binary  search  tree  to  locate 
strings.   Specifically it consists of an array of structures  of 
the form

typedef struct tree {
  struct tree *parent;  /* binary tree pointers, or NULL */
  struct tree *child1;  /* less than */
  struct tree *child2;  /* greater than */
  unsigned char *text;  /* string pointer */
  unsigned int count;   /* number of children + 1 */
  int first;            /* number of bytes common with parent */
 } TREE;

with  one structure for every string in the  dictionary.   (Note, 
these  are not NULL terminated strings,  but it is easier to  say 
"string"  than "block of 65 consecutive bytes.")  One would do  a 
string  compare  to 'text' and move to 'child1' or  'child2'  and 
repeat.   The  'parent' entry was used for  tree  balancing,  but 
turns out not to be necessary.   The 'count' entry was used  both 
for tree balancing and for coding,  and the 'first' entry was  an 
attempt  to  reduce  the number of byte  comparisons  during  the 
string compares.  (However, nothing guarentees that child strings 
will have ANY bytes in common.)
     To locate the longest match one must search until you hit  a 
NULL  pointer.   Given  the first (in search order) match  for  a 
given  length,  one must search all child strings to  locate  the 
most  recent matching string (for the FIFO buffer) or search  for 
the lowest and highest matching strings (for the dictionary).


FIFO Buffer - Hash Table

     Actually,  the method developed in the last report was  more 
of  a  cross  between  a  hash  table  and  binary  search  tree.  
Specifically, it consisted of 16 byte structures of the form:

typedef struct hash {
  struct hash *child1;    /* less than */
  struct hash *child2;    /* greater than */
  struct hash *next;      /* start of next tree */
  unsigned int sum;       /* sum of all matching strings */
  unsigned char index;    /* index of most recent match */
  unsigned char test;     /* character */
 } HASH;

Given a pointer to the start of the table,  one would search  for 
the  first  byte in the string using a binary search  tree  using 
'child1',  'child2',  and 'test'.   Once you have found the first 
byte,  you can search for the second byte using the binary search 
tree which starts at 'next'.  'Sum' is used by the tree balancing 
algorithm,  and 'index' is the array index for the  corresponding 
string in the FIFO buffer.
     The  number of structures used is variable,  with the  worst 
case being 64 * 65 * sizeof(HASH) = 65K bytes, since there are 64 
strings in the FIFO buffer, and each string has at most 65 bytes.  
The  structures are initialized as a simple linked list  of  free 
structures.   Also,  you will need an array of 64 string pointers 
and weights.   (The hash table is used for coding only.)
     The  advantage  of this approach is that  we  have  replaced 
string  compares  with byte compares.   Once the  first  byte  is 
found, one need only search for the second bytes, etc.  Also, the 
'index' value is automatically reset each time a string is  added 
to  the hash table,  so one need not search for the  most  recent 
match.

Main Dictionary - Layered Binary Search Tree

     This  approach combines the speed of a hash table  with  the 
low  memory requirement of a binary search tree.   As before  you 
need  an  array  of  structures,  one  for  each  string  in  the 
dictionary.   In  addition  you will need a  variable  number  of 
structures  which point to the same strings,  but are located  in 
lower layers.   The worst case is also the same as the number  of 
strings  in the dictionary,  for a total of twice the RAM  needed 
for a simple binary search tree.
     The structures used here are given by:

typedef struct search {
  struct search *child1;  /* less than */
  struct search *child2;  /* greater than */
  struct search *next;    /* start of next tree */
  unsigned char *text;    /* string pointer */
  unsigned int sum;       /* sum of all children */
  unsigned char len;      /* number of common chars */
  unsigned char test;     /* first uncommon char */
 } NODE;

The 'len' parameter is constant over a layer.   The 'test'  entry 
is redundant,  where p->test = p->text[p->len],  but is faster to 
access  and is free (since structures always take an even  number 
of bytes).
     Given a pointer to the start of the tree,  one would compare 
the first byte against 'test' and perform the usual binary search 
via 'child1' and 'child2'.  If a match is found, 'next' points to 
the start of another binary search tree,  but not necessarily for 
the second byte.   One would first have to check all the bytes in 
p->text[1]  through  p->text[p->len - 1] (where p points  to  the 
first entry in the next layer).   If they all match, one can then 
search  for  a  matching 'test' using the binary  tree  for  this 
layer, etc.
     The  advantage is again that you need only search  for  each 
byte once.   Also,  when you find a match of a given length,  the 
number  of matching entries is given either by 'sum' (if you  are 
between layers), 'next->sum' (if it exists), or is 1.
     The trick is to copy the parent structure (at least, it uses 
the  same 'text' pointer) whenever you start an new  layer,  then 
add the new string as a child structure of the copy.   Similarly, 
whenever  you remove a string from the dictionary,  you  have  to 
search the next layer to find the duplicate,  replace the  parent 
with an structure from the next layer,  and replace the duplicate 
with  a  copy of the new parent (unless there are no more entries 
on that layer).  It should be remembered that the duplicate to be 
removed might also have a duplicate in the next layer.

Tree Balancing

     Consider the two binary search trees organised as follows:

                   A                   C
                    \                 /
                     \               /
                      C             A
                     /               \
                    B                 B

where A < B < C by value.  It should be noted that any additional 
children of A B or C would remain in place if one were to convert 
from one to the other.  This transformation is called a rotation.
     Starting  with A using the left tree,  suppose we wanted  to 
add a new node D > C.   Starting at A we see that we need to move 
to right.  However, before doing so we check to see if the number 
of children to the right (child2->sum) is greater than the number 
of  children to the left (child1->sum).   If so,  we perform  the 
rotation.   Since we are now at C instead of A, we need to repeat 
the comparison.  Also, before moving down we should increment the 
'sum' entry at C.
     One  can  avoid the need for a parent pointer by  using  the 
approach:

HASH *p,**from;

  from = &start;
  p = *from;
  while(p) {
    ...
    if(byte < p->test) from = &(p->child1);
    else if(byte > p->test) from = &(p->child2);
    else from = &(p->next);
    p = *from;
   }

Changing '*from' will change the appropriate pointer,  no  matter 
where it came from.


Fine Tuning

     The following table shows the results for various methods of 
choosing  which  matches to use.   The first number is  the  time 
required  to perform the compression (accurate to 2 seconds)  and 
the second is the size of the file.   The 'ZIP' entry is used for 
comparison (times are not available).


       | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT |
-------+--------------+-------------+--------------+
   ZIP |       25134  |      32417  |      496876  |
 Smart |  18   22284  |  22  32567  | 418  503524  |
  Best | 102   22029  |  64  31997  |1670  453142  |
 Best1 |  76   21746  |  54  32451  |1654  446201  |
 Best2 |  70   21960  |  50  32372  |1454  446607  |
-------+--------------+-------------+--------------+
    8K |  20   22444  |  22  32405  | 424  496712  |
   384 |  60   21979  |  40  32417  |1190  419201  |
 8K384 |  60   22133  |  40  32288  |1164  421197  |
-------+--------------+-------------+--------------+

     The 'Smart' method is essentially the same as that developed 
in the previous report.   It looks for the longest combination of 
two strings.   To decide between a string from the FIFO buffer or 
the main dictionary,  each match is ranked by the weight for that 
offset times the total number of entries in the main  dictionary, 
or the sum of matches in the main dictionary times the sum of all 
the  weights  for  the  FIFO  buffer.    To  decide  between  two 
combinations of strings which have the same combined  length,  it 
looks at the rank times the length of the first match.   This  is 
the fastest method which is still competitive with ZIP.
     The 'Best' method looks at all possible ways to compress the 
next 65 bytes until a unique first step is determined.   Analysis 
by  the  Pure  Profiler showed that most of  time  is  now  spent 
computing the number of bits needed to code strings,  rather than 
searching for them.
     'Best1' shows a variation of the 'Best' algorithm which only 
codes a single byte using arithmetic if there is no other way  to 
get to the next byte.   'Best2' also checks if the number of bits 
used to get to the end of the longest combination found so far is 
less  than the number of bits needed to get to the current  byte, 
in  which case it doesn't bother to test if any strings  starting 
from  the current byte will help.   While these  variations  only 
achieve marginal improvements in speed, there is very little loss 
of  compression,  and sometimes they even out-perform the  'Best' 
algorithm.
     The  best  thing about all these methods is  that  they  are 
totally  compatible  with  the current  versions  of  ELFARC  and 
ELFBACK.   However,  if  we relax  this  requirement,  additional 
savings are possible.
     '8K'  shows  the  'Smart' algorithm results  when  the  main 
dictionary  is  reduced  from 16K to 8K  entries  (640K  to  320K 
bytes).  Note that this method beats ZIP on all three files.
     '384' is a variation of 'Best2' which uses 384 codes instead 
of 320.   Specifically,  these consist of the 256 basic codes, 64 
length codes for the FIFO buffer and 64 length codes for the main 
dictionary.   This  was  intended to reduce the number  of  steps 
needed to code a string (or compute the number of bits needed  to 
code  a string).   Whether it benefits of hurts  the  compression 
depends  on whether the distribution of lengths are  similar  for 
the  two dictionaries.  '8K384' uses both an 8K  main  dictionary 
and the 384 codes.   Interestingly,  using 384 codes had  benefit 
for the 'Smart' algorithm whatsoever.
     I also discovered that a very minor improvement in speed  is 
possible  by sorting the codes in the arithmetic  compression  so 
that the most common codes are close together.

Conclusion

     At  this  point  the difference  in  compression  efficiency 
between  Smart  and  Best approaches is  minor  compared  to  the 
difference in speed.   Since the '8K' method satisfies the design 
criteria,  it will be used for the next versions of both  ELFBACK 
and ELFARC.